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Abstract 

We consider the problem of simultaneous variable selection and constant 
coefficient identification in high-dimensional varying coefficient models based 
on B-spline basis expansion. Both objectives can be considered as some type 
of model selection problems and we show that they can be achieved by a dou- 
ble shrinkage strategy. We apply the adaptive group Lasso penalty in models 
involving a diverging number of covariates, which can be much larger than the 
sample size, but we assume the number of relevant variables is smaller than the 
sample size via model sparsity. Such so-called ultra-high dimensional settings 
are especially challenging in semiparametric models as we consider here and has 
not been dealt with before. Under suitable conditions, we show that consistency 
in terms of both variable selection and constant coefficient identification can 
be achieved, as well as the oracle property of the constant coefficients. Even in 
the case that the zero and constant coefficients are known a priori, our results 
appear to be new in that it reduces to semivarying coefficient models (a.k.a. 
partially linear varying coefficient models) with a diverging number of covari- 
ates. We also theoretically demonstrate the consistency of a semiparametric 
BIC-type criterion in this high-dimensional context, extending several previous 
results. The finite sample behavior of the estimator is evaluated by some Monte 
Carlo studies. 

keywords: Adaptive Lasso; Extended BIC; B-spline basis; Semivarying 
coefficient models; Varying coefficient models; 
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1 Introduction 



Consider a varying coefficient model ( iHastie and Tibshiranil . Il993h 



F = X/3o(t) + e, 



where X is a n x p covariate matrix, (3o{t) = (/3oi(t), • • • , /Sopl^))"^ is the varying 
coefficients and e = (ei, . . . , e„)"^ contains the mean zero noises. For better model 
interpretation and efficient estimation, it is desired to identify those irrelevant covari- 
ates {Pj{t) = 0) as well as covariates associated with constant coefficients {Pj{t) = c 
for some constant c). We allow p » n but the number of nonzero coefficients is 
smaller than n while still converging to infinity. 

For varying coefficient models, estimation can be performe d based on loca l poly- 
nomial regression, B-spline expansion, or smoothing splines (|Fan and Zhand . Il999l . 
2OOOI : IChiang et all I2001I : iHuang et all bool I2OO4J : lEubank et alll2004h . Local poly- 
nomial regression is a most popular approach, but it requires solving many similar 
optimization problems on a fine grid on the support of the index variable. Thus here 
we choose the B-spline expansion approach. 

Shrinkage estimation for variable selection has attracte d much attention recently, 
with r nany contribut i ons on the l inear or parametric n i odel ( Tibshiranil. ll996uFan and Lil . 



2OOII : iFan and Fend . 12004 : IZoul . l2006l : lYuan and Linl . l2006l . l2007l : IZou and Li 120081) 



Applying this approach to nonparametric or semiparametric problenis is more rec ent, 
probably starting with the COSSO method (ILin and Zhang . l2006l : IZhang. 120061) for 
nonparametric models. For varying coefficient models in particular, ( iWang and Xial . 
2OO9I : IWang et al.l . l2008l ) studied the variable selection problem using kernel regres- 
sion and B-spline expansion respectively, when the dimensiona lity is fixed. Exten- 
sion to generalized semivarying coefficient models is presented in lhi and Liang) ( 2008 ) 



where penalization is used for selecting predictors in the parametric component only. 
Studies o n cons tant coefficient identification is comparatiyely sc ar ce, and include 



Xia et al.l (120041 ) which used cross-validati on. iHuang et al.l (120021 ): iFan and Huang 



(I2OO5I ) which used hypothesis testing, and iLeng (120091 ) which used penalization for 
identifying constant coefficients in the context of smoothing splines. All of these 
works treat fixed dimensional problems. Regularization method for variable selection 
with a diverging dimensionality has been i nvestigated recently for additive models 



(iRavikumar et all. l2008l: iMeier et al.l . l2009l : iHuang et al.l . |201Q±|). For partially lin 



ear models, IXie and Huang (120091 ) considered variable selection for the parametric 
component when dimension increases with sample size. 

Based on the works mentioned above, selecting relevant variables and choosing 
constant coefficient in a varying coefficient model is not a new problem, but our goal 
here is obviously more ambitious. First, we consider a diverging number of predictors 
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that can increase exponentially in sample size. Such a large dimension in nonpara- 
metric models has only been used in additive models as mentioned in the previous 
paragraph. For our semiparametric (since it will reduce to semivarying coefficient 
models with both nonparametric and parametric components, even when the model 
is correctly specified) varying coefficient models, the situation is more complicated. 
Second, we consider regularization method for simultaneous variable selection and 
constant coefficient identification. Given that our method can achieve both goals, a 
semivarying coefficient model results. The asymptotic property of semivarying co- 
efficient models with a diverging dimensionality appears to be new and of interest 
in itself, even without penalization. Third, we introduce a semiparametric BIC-type 
criterion for automatically choosing the regularization parameters. Consistency of 
BIC-type criterion in the regularization framewor k for nonparametric models has 
only been shown in the case of fixed dimension ( iWang and Xial . 120091 ). Even for 
linear models, consistenc y has been considered only in the ca s e with p increases poly- 



nomially in sample size ( IChen and Chenl . l2008l : IWang et al.l . |2009[ ). All these make 



our theoretical investigations very challenging, due to high dimensionality and double 
penalty. 

Although other penalties such as SCAD can be used, here we choose the alter- 
native adaptive group Lasso penalty. The advantage is that the criterion function 
is convex and a global optimum is guaranteed. Convexity also means the first order 
KKT condition is both necessary and sufficient for optimality which is the key in 
our proofs. The rest of the article is organized as follows. In the next section, we 
present the estimation procedure using B-spline basis expansion and discuss some 
computational issues. Theoretical results are given in Section 3 with proofs relegated 
to the Appendix. Section 4 briefiy discusses the choice of the initial estimator before 
the adaptive group Lasso penalty can be applied. Section 5 contains some simulation 
studies used to illustrate the performance of the estimator, and we conclude in Section 
6. 



2 Penalized estimation with double adaptive Lasso 
penalty 

First we note that many quantities that appear in our exposition, including the di- 
mensionality p, implicitly depend on n. Let (Xj, Yi,ti),i = 1, . . . , n, be independent 
and identically distributed observations from the varying coefficient model ([1]) and 
for simplicity we assume the index variable t has a distribution supported on [0, 1]. 
We use polynomial splines to approximate the coefficients. Let = < < ■ ■ • < 
^K' < 1 = ^K'+i be a partition of [0, 1] into subintervals [^k, ^fc+i), k = 0, . . . , K' with 
K' internal knots. We only restrict our attention to equally spaced knots although 
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data-driven choice can be considered such as using the quantiles of the observed ti. 
A polynomial spline of order d' is a function whose restriction to each subinterval is 
a polynomial of degree d' — 1 and globally d' — 2 times continuously differentiable 
on [0,1]. The collection of splines with a fixed sequence o f knots has a no rmalized 



B-spline basis {Bi{t), ...,BK{t)} with K = K' + d' . As in IPe Boor! (120011), we also 
assume that a linear combination of basis functions X^^i (^kBk(t) is a constant a if 
and only if which can be achieved by making the boundary knots 

have multiplicity d' for example. Using spline expansions, we can approximate the 
coefficients by (3j{t) ~ bj^Bkit). Note that it is possible to specify different K for 
each coefficient but we assume they are the same for simplicity. 

We are especially interested in a sparse model where many of the coefficients /3oj 
are zeros, and in addition some coefficients are non- varying constants. To fix ideas, we 
assume the first pi coefficients are truly varying, the next p2 coefficients are constants 
and the rest are zeros, and let s = pi + p2 < p be the total number of nonzero 
coefficients. In order to automatically identify those special coefficients, we propose 
the following penalized least square estimation procedure 



K 



b = argmm - ^(Kj - "^"^ XijbjkBk{ti)f + n\i^wij\\bj\ \ + nA2 ^ W2j| U 
i j=i k=i j=i j=i 

(2) 

where Ai, A2 are regularization parameters, Wi = {wn, . . . , Wip) and W2 = (^21, . . . , W2p) 
are two given vectors of weights, need to be appropriately chosen in order to achieve 
consistency in model selection. One possible choice of these weights is obtained from 
an initial estimator based on group Lasso penalty (that is, equation ([2]) with weights 
equal to 1), resulting in a globally two-step approach in estimating the coefficients. 
Some discussions on the initial estimator are provided in Section 4 and for now we as- 
sume the weights are already given. For the penalty terms in ([2]), ||a|| = iJ2k=i 
is the I2 norm of any i^'— dimensional vector a and \\a\\c = C^k=i('^k — dyy^"^ with 
d = XlfeLi (^k/K . We note that the first penalty is used for identifying zero coefficients 
while the second is used for identifying constant coefficients, since \\bj\\c = if and 
only if bji = ■ ■ ■ = bjx- For future reference, we remark that \\a\\c is actually the 
Euclidean distance from a to the linear subspace L = {bl,b G R}, where 1 is the 
vector with all components ones, and can thus be written equivalently as HQloU with 
Ql the K X K matrix representing the projection onto the orthogonal complement 
of L. 

The mini mization pr o blem c an be solved by loc a lly quadratic approxi mation as 



)pr 

suggested in iFan and Lil tOOlh : IWang et all fl2008f ): IWang and Xial (lioOQf ) which is 
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by now a rather well-known and standard algorithm. Using the notations 



XijBi{ti) XijB2{ti) ■■■ XijExiti] 

XnjBiitn) XnjB2(tn) ■■■ XnjBxitnj / ^^^j^. 

Z = {Zi, . . . , Zp), Y = (Yi, . . . , Yn)'^, (|2]) can be written in matrix form as 

^ VP 

[•gmin - Zh\\^ + nAi ^ Wy] | + nX2 ^ iy2j||&j||c- (3) 



arj 

' b 2" 

i=i i=i 

The locally quadratic approximation approach iteratively solves 

p 

\Y - ZhW^ + nX,' 

b 2' 



argmini||y-Z6|p + nAiX^t.y||6,||Vl|fef ||+r^A2X^^.2,|^ 



=1 i=i 



with 6'^°) the current estimate. However, with double penalties, we need to keep track 
of both zero coefficients as well as constant coefficients during the iterative process, 
making the implementation slightly more complicated than usual. The details are 
omitted here. 

In practice, we need to choose some parameters including the spline order d', the 
number and positions of the knots of the spline basis as well as the two regularization 
parameters. To ease the computational burden, we fix ci' = 4 and i^' = 10 with 
equally spaced knot sequence in our implementation and choose only Ai and A2 based 
on data. This strategy is well known in the functional smoothing/functional data 
analysis literature, where the number of knots is chosen to be sufficiently large to 
reduce bias in function approximation since the variance can be effectively controlled 



by su bsequent penalization (see for example Chapter 5 of iRamsay and Silverman 



(120051 ) for a detailed illustration of this effect in the functional smoothing context). 
It is also possible to position the knots based on sample quantiles of the observed 
index variable, but since choosing optimal knots is not the focus of the paper we will 
only use equally spaced knots for simplicity. 

We use a BIC-type criterion to select simultaneously Ai and A2, given by 

urn 1 r^iiv i|2i , j ^^gn \og{n/K) 

BICx = \og{-\\Y - ZbxW l + rfi Cn + d2 rr^Cn, (4) 

n n n/K 

where 6a is the minimizer of (jS]) given A = (Ai,A2), di is the number of coefficients 
estimated as nonzero constants and d2 is the number of coefficients estimated as 
truly varying. We will show later that the BIG is consistent in model selection if 
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Cn = n{^\og{pK)) and Cn\ogin/K)/{n/K) ^ under some additional assump- 
tions, where the notation a„ = ^{bn) means = 0(a„). We will use Cn = ^/\og{pK) 
in our simulations which produces reasonable results. Although it is unsatisfactory 
th at Cn must be cho sen in a somewhat arbitrary way, the same problem appeared 
in IWang et al.l (120091 ) in which some arbitrary value (among many possibilities) that 



satisfies their theoretical conditions is picked and its performance is verified using 
Monte Carlo examples. We will refer to the criterion (HI) with C„ = >y\og{pK) as the 
extended BIG (EBIC) following Ichen and ChenI tOoS ). while with C„ = 1 we obtain 
the ordinary BIG. 



3 Asymptotic results 

We first introduce the following notations. Let Z^^'^ be the n x piK submatrix of Z 
containing the columns corresponding to truly varying coefficients, and similarly let 
Z^^) be the submatrix corresponding to constant coefficients and Z^^^ the submatrix 
corresponding to zero coefficients. In the same spirit, we can define X^^\ X^'^\ X^^^ 
as suitable submatrices of X, with the corresponding random variables denoted by 
Similar notations are also applied to vectors b and /3(t). 
Let Q denote the subspace of functions on RP^ x [0, 1] 

g := {(7(x«,t):^7(x«,t)=x«^M^),Mt) = (/ii(t),...,VWf 

pi 

with some functions hj{t) and E'^^xf ^'^h'^{xi,t) < oo}, 

i=i 

and for any random variable w with E{w'^) < oo, let Eg{w) denote the projection of 
w onto Q in the sense that 

E{{w - Eg{w)){w - Eg{w))} = iuf E{{w - g{x''^\t)){w - g{x^^\t))}. 

Definition of Eg{w) trivially extends to the case if is a random vector by componen- 
twise projection. 

In the theoretical studies of our estimator, we will use the decomposition 

= e{x^'\t) +u = e{x^'\t) - g{x^^\t) + g{x^'\t) + u, (5) 

with 6'(x(i),t) = E(x(2)|x(i),t), gix^^\t) = Eg(x(2)). Note that since the condi- 
tional expectation E{x^'^^\x^^\t) can be interpreted as projection onto the space 
{h{x^^\t), Eh? < oo} of which ^ is a subspace, we see that we also have g{x^^\t) = 
Eg{e{x^^\t)). Let H = E{(x(2)-^(x(i),t))(a;(2)-^(x(i),t))^} which can be considered 
as the residual variance of x*^^-* after projection. 



6 



For adaptive group Lasso penalty in ([3]), the weights wij,s + I < j < p are 
associated with the zero coefficients and W2j,Pi + I < j < p are associated with 
constant (including zero) coefficients. Asymptotically, these weights do not appear 
in the convergence rates if we can consistently select the true model. Thus it makes 
sense for our asymptotic investigation to define \\w[\\ = (^j=i 'U^ij)"'^'^^ and ||w2|| = 
(SjLi "^ij)^^^ which will appear in the convergence rates. 

First we consider the case where covariates corresponding to zero and constant 
coefficients are known to us. In this case, we have a "regularized oracle estimator" 
{b^^\ (3^"^^) obtained from minimizing the following functional 

g(6(i),/3(2)) = l||y_z(i)5(i)_x(2)/3(2)||2 + „Aif]w;i,||6j.')|| 

Pl s 

+nX2j2^2j\\bf^\\c + nX^VK ^ijl^fl (6) 
j=i j=pi+i 

where 6*^^^ is a piK dimensional vector corresponding to the truly varying coefficients 
and Z?'-^'' = {(3^^_^_i, . . . , Ps^^)"^ are the constant coefficients. The extra y/W in the 
penalty above is due to that = \/K\/3j\ when bji = . . . = bjx = l3j- 

We will consider rates of convergence as well as asymptotic normality of the re- 
sulting estimator. Note that our results for the minimizer of (|6]) cover the unpenalized 
case Ai = A2 = and thus provide some asymptotic analysis of semivarying coefficient 
models with diverging dimensionality, which is of independent interests 

The conditions required for our theoretical results on the regularized oracle esti- 
mator are listed here. 

(cl) The covariates have finite fourth moments, maxj EXfj < 00, and the eigenvalues 

are bounded away from zero and infinity. 

(c2) The noises are independent of covariates, have mean zero, variance a^, and 
finite fourth moment. 

(c3) The index variable t has a density bounded away from and infinity on [0, 1]. 

(c4) For 1 < J < Pl, Poj{t) satisfies a Lipschitz condition of order d > 1/2: \/3llf^\t) — 
l^of^\^)\ — ~ t\^~^'^^ , where \_d\ is the biggest integer strictly smaller than 
d and /3oj'^^^(t) is the [rfj-th derivative of Poj{t). The order of the B-spline used 
satisfies d' > d + 2. 

(c5) Ks/n -> 0, s/K^'^ -> 0, (A^Ump + Xl\\w'^W^)K 0. 
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(c6) The eigenvalues of S are bounded away from zero and infinity. 

(c7) In the decomposition ([5]), each component of g{x^^\t) can be written in the 
form Y7f=i ^f'^^ji't) for some hj. We assume all hj satisfy a Lipschitz condition 
of order dg > 1/2: \h\^'^'^\t) - h\^'^'^\s)\ < C\s - t|^9-KJ. The order of the 
B-spline used satisfies d' > dg + 2. 

(c8) Ks^/n 0, sVir^'is ^ 0, and ^K-^'^+'^a^ 0. 

In condition (cl), we only require the eigenvalues of the second moment matrix 
of covariates associated with nonzero coefficients are bounded away from zero and 
infinity. Conditions (c2)-(c4) are standard. The convergence rate ([7]) below would 
be void without condition (c5). Other conditions are used in showing the faster 
convergence rate of the parametric component in (|6]), which is the more difficult part 
of the proof. (c6) and (c7) imply that x'-^-' is not in Q and its projection onto Q is 
s mooth enough. These c onditions are similar to Assumption (A2) and Condition 1 



m 



Xie and Huang! ( 2009 ) respectively for high-dimensional partially linear models. 



From the rates obtained below, if Ai = A2 = (or small enough), the optimal number 
of knots in spline expansion is ~ 77,1/(2(^+1) ^g^a^i_ 

Theorem 1 (Convergence rates) Under conditions (cl)-(c5), the nonparametric com- 
ponent of the minimizer of b^^\ satisfies 

where W is any vector satisfying \ \f3oj{t) ~Ylkb%^k{t)\ \ = 0{K~'^'^). As an immediate 
corollary, 

E - Mm' = o(^ + jk + i^iiKw' + , (7) 

where Poj{t) denotes the true coefficients and /3j^^(t) = '^k^fk -^kit) ■ 

For the parametric part, under additional assumptions (c6)-(c8), we have the 
faster rate 

X: |/3f - /3o,r = 0{^ + {\l\\w[f + \l\W,f)K) . 

The following conditions are assumed for asymptotic normality of the parametric 
component. 
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(c9) s/K"^ ^ 0, \f^K-^'^+^^^ 0. 



(clO) ^nKs{Xi\\w[\\ + \2\\w'2\\) 0. 

Theorem 2 (Asymptotic normality) Let An he a deterministic m x p2 matrix with 
m an integer that does not change with n, and S„ = A^S'^A^ (E is defined below 
IjiS^). Under conditions (cl)-(clO), 

x/rzE;^/M„(/3(2) _ ^J2)^| ^ N{Q,(T'^Ira) m distribution , 

where Im is the m x m identity matrix. 

We will now show that the estimator from ([3]) is exactly equal to the regular- 
ized oracle estimator from ([6]) with probability converging to 1. In particular, this 
immediately gives the same convergence rates as well as asymptotic normality as in 
Theorems [U and [2] for the estimator even when the position of the zero and constant 
coefficients are unknown. In order for the adaptive group Lasso estimator to identify 
the correct model, we need to make sure the weights wy, s + 1 < j < p associated 
with zero coefficients and weights W2j,pi + 1 < J < p associated with constant coef- 
ficients are big enough to force sufficient penalty. The following two conditions make 
this requirement exact. Our conditions are stated for direct use in the proof of the 
theorem and seem complicated. We will make the conditions more explicit in Section 
4 and show that these conditions can be naturally satisfied. 

(cll) ./^{./hg]^+./K^T^^I/K^+V^ 
l<j<P- 

(cl2) ,/^{^\og{pK) + ^Ks + ns/K2'^+v^(Ai| im I+A2I im I)} = o(nAiWi,-), s+ 
^<j<P- 

Theorem 3 Assume conditions (cll) and (cl2) as well as those in TheoremUl 
Suppose {U^\ P^"^^) solves the problem Define b = {U'^\U'^\U^^) with })~^^ = 

(5f\pi + 1 < j < s,l < k < K andbf^ = 0,s + 1 < j < p,l < k < K. Then with 

probability approaching 1, b is the solution of the original problem (Q. As a corollary, 
the rates of convergence ofb is the same as those stated in TheoremUl and asymptotic 
normality of the estimated constant coefficients holds under the additional conditions 
assumed in Theorem\^ 

Finally, we consider the consistency of the BIC-type criterion. Since we consider 
ultra-high dimensional problems here with p » n, for technical reasons, we will 
assume that the number of nonzero coefficients s = 0(1) does not increase with n, 
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and that we only select among potential models with dimension upper bounded by a 
known integer S. Although restrictive in some situations, this assumption is satisfied, 
say, when we know that only a small number of predictors are relevant even as we 
collect more predictors as sample size increases, and we have an a priori bound on 
the number of relevant covari ates. In the case of par ametric models, even with p only 



increasing polynomially in n, I Chen and ChenI (120081 ) also makes this assumption. We 
need the following conditions. 

(cl3) Both infi<j<p^ l|/5oi(^)||c and infpj+i<j<s |/3oj| are bounded away from zero. 



(cl4) K ~ = Q ( ^/\og{pK) ) , \og{n/K) /{n/K) ^ 0. 

Theorem 4 // the number of nonzero coefficients s does not increase with n, and we 
only consider models with at most S (also does not increase with n) nonzero coeffi- 
cients with s < S. Under conditions (cl3) and (cl4), in addition to those assumed 
in Theorem U\ and Theorem 0, the BIC-type criterion ^ will correctly identify the 
nonzero coefficients and the constant coefficients with probability approaching 1. 



4 Initial estimator with Lasso penalty 

In the adaptive Lasso penalty, conditions (cll) and (cl2) require that the weight Wij 
is large for zero coefficient a nd small fo r nonzero ones, and similar requirements for 



W2j are imposed. Following IZoul ( 120061 ) where the adaptive Lasso is first proposed, 
we set Wij = l/\\bj\\ and W2j = l/||6j||c using an initial estimator b obtained by 
minimizing the least square with group Lasso penalty 

1 ^ 
b = argmin — Zb\\'^ + nXo H^jll- 

i=i 

Note that to obtain the initial estimator, it is only necessary to use a single penalty 
term. 

Theorem 5 Under conditions (cl)-(c5), if Aq > C\J s \og{pK)/n for sufficiently 
large C > 0, all coefficients except Ms of them are estimated as zeros where M 
is a finite constant M > 1. In addition, we have the convergence rate 

where W contains the coefficients in the optimal approximation of Poj,l < j < s in 
spline basis expansion. 
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Compared with Theorem [H the extra factor s \og{pK) in the convergence rate 
is due to that we do not have a priori knowledge on the nonzero components as in 
Theorem [H and the logarithmic factor turns out to be the resulting cost (also see the 
proof of Theorem [3] where similar logarithmic factors appear in conditions (cll) and 
(cl2)). 

Equipped with the initial estimator which gives us the weights in (|3]), we will 
demonstrate that various conditions imposed in the previous section can be satis- 
fied. First we fix Aq = C a/s \og{pK) / n and K ~ 7^1/(2(^+1)^ Then the convergence 
rate of ||6 — 6|| in Theorem [5] for the group Lasso estimator is K'^s'^ \og{pK) / n = 
o{\fK) if we assume K \og{pK) / n — )■ 0, which is stronger than (c8). Suppose 
that condition (cl2) on the true coefficients is satisfied, then the weights satisfy 
wij = 0(1/ VK), 1 < j < s and wij = n{^n/{K^s^\og{pK))), s + 1 < j < p. Simi- 
larly W2j = 0{1/\/K), I < j < Pi and W2j = Vl{^J n / {K'^ s"^ \og{pK))),pi + 1 < j < p. 

If Ai, A2 = 0(v/^), then (A?| p + AsI |m Hi^^ = 0{K'^s/n) and thus the last 
term in the convergence rate of — 6|p in Theorem [1] can be ignored. If furthermore 

Ai,A2 = o(^), (8) 

then condition (clO) is satisfied. 

To fix ideas, suppose now logp = n'' with < g < 1. Conditions (cll) and (cl2) 
impose that 

Ai,A2>>max| , j. (9) 

n n 

If s = 0(1) (although not necessary), there exists Ai,A2 that satisfies both ([H]) and 
(E]) if g<rf/(2d+l). 

To make the initial estimator effectively usable as weights, the regularization pa- 
rameter Ao must be large enough so that many zero coefficients are correctly identified, 
but small enough that it still obtains reasonable convergence rates. We do not have 
corresponding theoretical results on how to choose Aq based on data. In our simu- 
lations, we use both ordinary BIC and EBIC to select this smoothing parameter. It 
is found that while EBIC is better at identifying the correct model when using the 
group Lasso penalty, BIC is more desirable in this initial step when considering our 
adaptive group Lasso penalty. 

5 Simulation 

In this section we use some simulations to evaluate the finite sample performance of 
the adaptive group Lasso in variable selection and constant coefficient identification. 
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The datasets are generated from model ([T]) with sample size n = 100 and noises ~ 
A^(0, 0.1). The index variable t is sampled uniformly on [0, 1], and the predictors are 
Xii = 1 with other Xj^'s marginally standard normal with within subject correlations 
Cov{Xijj^, Xij^) = (1/2) '■''^"■'2' . The first three coefficient functions are truly varying 
with 



There are 6 constant coefficients specified as (^4 = /S^ = 1.5, /3q = (3t = 0.5 and 
/^s = /Sg = 0.1. All other coefficients are set to be zero. Since we focus on high- 
dimensional models here, we consider both p = 50 and p = 150. For both scenarios, 
500 datasets are generated and fitted. We compare adaptive group Lasso with group 
Lasso and also compare the effects of using ordinary BIG with extended BIG. We fix 
the number of spline basis K to be 10 which is sufficiently flexible to approximate 
the varying coefficients. For group Lasso estimator, we use both BIG and EBIG for 
model identification. We also consider adaptive group Lasso estimator when group 
Lasso estimator (using ordinary BIG) is used as the initial estimator, with Ai and A2 
chosen by either ordinary BIG or EBIG. In Table [1], we show the number of identified 
zero and constant coefficients by different methods, with information criterion used 
in each case indicated in brackets. For example, the row indicated as aglasso(BIG- 
EBIG) shows the results for the adaptive group Lasso estimator when BIG is used 
in choosing Aq for the initial group Lasso estimator and EBIG is used in choosing 
smoothing parameters for the final estimator. We see that when EBIG is used for the 
initial estimator, some nonzero coefficients are incorrectly identified as zeros. Note 
that these mistakes cannot be corrected by the subsequent adaptive group Lasso 
estimator. On the other hand, if BIG is used for the initial estimator, although many 
zero coefficients are identified as varying, these mistakes can however be corrected by 
the final estimator. This is actually why we don't consider the combinations EBIG- 
BIG and EBIG-EBIG for the final estimator in our simulations. Another important 
conclusion to be drawn from the table is that model selection using BIG-EBIG is better 
than using BIG-BIG. For example, when p = 50, the number of zero coefficients is 
41 and on average 40.26 of them are identified using BIG-EBIG while only 36.84 of 
them are identified using BIG-BIG (i.e., more false positives). BIG-EBIG also works 
better for identifying the constant coefficients. 

In Table [21 we present the estimation errors (in L2 norm) for some of the coeffi- 
cients. Note that based on the true model, /32, (3^ are varying coefficients, ^4, (3q, /3s 
are constants and /3io is actually zero. We also show in the last column of the ta- 



/3i{t) 

m 



3 sin(27rt), 
8t(l -t), 
cos[(27rt)=^]. 
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Table 1: Model selection results of different estimators based on 500 replications, with 
n = 100. 





Avg # of 


zero coefficients 


Avg # of const, coefficients 


correct 


incorrect 


correct 


incorrect 


p = 50 glasso(BIC) 


5.94 











glasso(EBIC) 


40.78 


4.72 








aglasso(BIC-BIC) 


36.84 


0.01 


3.98 


3.12 


aglasso(BIC-EBIC) 


40.26 


0.02 


5.54 


0.74 


p = 100 glasso(BIC) 


68.17 


0.07 








glasso(EBIC) 


127.36 


2.1 








aglasso(BIC-BIC) 


133.17 


0.07 


3.5 


6.8 


aglasso(BIC-EBIC) 


139.9 


0.37 


4.43 


1.13 



ble the estimation error of the oracle estimator where the true model is known and 
no penalization is used, with K selected by GCV criterion (note that here we need 
to choose K based on data since there is no subsequent penalization that reduces 
the variance of the estimator if K is fixed to be sufficiently large). From the table, 
we see that adaptive group Lasso estimator in general performs better than group 
Lasso estimator and for adaptive group Lasso estimator, using BIC-BIC and BIC- 
EBIC produces similar results (note that this is in terms of estimation error only, and 
BIC-EBIC is better for identifying the true model). 



6 Conclusion 

In this paper we proposed an estimation method for identifying zero coefficients and 
constant coefficients simultaneously for high-dimensional varying coefficient models. 
The high dimensionality and the double penalties used to achieve both goals make 
the theoretical analysis harder than previously proposed models. We demonstrated 
convergence rates and asymptotic normality of the constant coefficients, and proposed 
semiparametric BIG as a consistent model selection tool. 

One possible extension of the current work is to consider generalized varying coef 



ficient models. Variable selection for such models has been considered in lLi and Liang 



(120081 ) based on local linear regression for fixed dimension. However, in their proce- 
dure, undersmoothing of the varying coefficients is necessary for efficient estimation of 
the parametric component. It is expected that such undersmoothing is not necessary 
for spline based method that estimates both components simultaneously. 
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Table 2: Estimation errors of different estimators based on 500 replications, with 
n = 100. 



glasso aglasso oracle 







BIC 


EBIC 


BIC-BIC 


BIC-EBIC 




p= 50 




0.0860 


1.2534 


0.0441 


0.0445 


0.0399 






0.1076 


1.3801 


0.0542 


0.0671 


0.0465 






0.1461 


0.5752 


0.0671 


0.0773 


0.0491 




/3a 


0.1078 


1.3779 


0.0361 


0.0197 


0.0148 




^6 


0.0792 


0.4718 


0.0295 


0.0196 


0.0171 






0.0460 


0.0998 


0.0364 


0.0242 


0.0153 






0.0188 


0.0003 


0.0060 


0.0023 


0.0000 


p= 150 


Pi 


0.1568 


0.5449 


0.0571 


0.0635 








0.1541 


0.5415 


0.0894 


0.0926 






133 


0.2452 


0.3879 


0.1221 


0.1401 






(3a 


0.1540 


0.5295 


0.0439 


0.0364 








0.1001 


0.2164 


0.0387 


0.0326 








0.0557 


0.0814 


0.0493 


0.0492 








0.0129 


0.0058 


0.0032 


0.0022 
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Appendix 



In some of the proofs below we will make use of some simple properties of the sub- 
differential and thus we first mention these properties here. For a vector b, the 
subdifferential of its I2 norm is 

d\\b\\ = l ^/ll^ll „ 

y some a with I |a| I < 1 if 6 = 0. 

Note that when 6 = the subdifferential is not unique but we still use d\\b\ \ to denote 
some subdifferential since its specific value has no sigficance in our proofs. Slightly 
more generally, for any matrix A, 



d\\Ab\\ 



A^Ab/WAbW ifAb^O 
A'^s for some a with ||a|| < 1 if Ab = 0. 



Proof of Theorem Ql The convergence rate for the nonparametric component 
is relatively easy to show. Instead of showing the rates for the regularized oracle 
estimator, we consider instead the minimizer b of the following functional 

Q'i^) = 2^1^^ ^^11^ + ^-^1 X^'W^ijll^ill + "-^2 ^ 1(^2^1 |6j||c, 

i=i i=i 

where only for the proof of Theorem [T] we set Z = {Z^^\ Z^"^^). That is, one knows the 
zero coefficients but does not constrain the truly constant coefficients to be constants. 
This makes the notation simpler. The convergence of regularized oracle estimator 
follows exactly the same lines. 

Suppose Pnj{t) = J2k=i^%^k{t) is the best approximating spline for (3oj{t) with 
Wl^nj — /^oilP = 0{K^'^'^). By the definition of 6, we have 

> Q'{b)-Q'{b^) 

S Pi 

> \\Y - Zb\\^/2 -\\Y - 1 72 - n\i ^ wij\\bj " I " ^^2 W2j\\bj -b]\\ 

s Pi 

= (F - ZbYZ{b' -b) + - 6)||V2 - nX,J2^^M ' " «;2il 

where in the second inequality above we used the property \ \a\\c < for any vector 
a. 
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Let r] = PziY - Zb^), where Pz = Z{Z^Z)-^Z^, be the projection of F - Zb° 
onto the columns of Z, then Lemma [1] shows that ||?7|P = Op{Ks + ns/K'^'^). Using 
the Cauchy-Schwartz inequahty, the above displayed equation can be continued as 



, s pi 

> -\Op{Ks+ns/K^'')\ + -\\Z{b'>-b)\\^-n\^Y,^lJ\\h-b^\\-^^2J2^^^^^ 

j=i j=i 

(10) 

Using now Lemma A.l in IWang et al.l ( l2008l ) together with condition (cl), which 



implies that ~ {n/K)\\b^—b\\'^, and using the Cauchy-Schwartz inequality 

nj^j ^iWijWbj - b°\\ < (CKn/A) ^^(AiWij)^ + {n/CK)\\b^ - b\\^ with a sufficiently 
large C > (similarly for 'n^.X2W2j\\bj — b^j\\), (fTOl) implies — = Op{K'^s/n + 
s/X^d-i ^ {XlJ2j=iwlj + XlhT=iwlj)K^)- The convergence rate for EjLi P]^^ - 
/SojiP is obtained from the well-know relation \\J2k'^kBk(t)\\'^ ~ Halp/i^ for any 
a = (ai, . . . ,a^). 

Now consider the faster convergence rate of the parametric components in the 
regularized oracle estimator, which we show by profiling out b^^^ in ([6]). For any given 
/3, let b{f3) be the minimizer of ([6]) when /3 is fixed. Again, for ease of notation, we 
write b^^^ simply as b, as (3, Z^^^ as Z, and X^"^^ as X. By the KKT condition, 
we know that b{(3) satisfies 



-Zj(Y -Zb- X/3) + n\iwijd\\bj\\ + n\2W2jd\\bj\l,j = l,...,p^. 
From the above expression we get 

= {Z^zy^Z'^iY - x/3) + {Z^Z)-h{P), (11) 

where v{(3) is a -dimensional vector with its j-th component given by nXiWijd\\bj{l3)\\ + 
n\2W2jd\\bj( 



Let Po be the true parameter and set /3 = /3o + 71^ with 71 = C{^/s/n + 
i^'(Ai| I-I-A21 1^2! I)) some C > 0, and | |u| | = 1. We will show that inf ||u||=i Q(b0), (3)- 
Q(b{f3o), f3o) > with probability approaching 1 for C large enough and the result 
will follow. 



16 



Using the closed form expression for b{(3), we get 

Q(6(/3),/3)-g(K/3o),/3o) 
= _(!> _ X(3o){^iXu + Z{Z^Z)-\{(3)) + (l/2)||7iXn + Z{Z^ Zy^Cm'' 
+ {Y-XP,YZ{Z^Z)-^v{P,) - {l/2)\\Z{Z^Z)-'v{P,W 

Pl s Pi 

+nAi^wij||6j-(/3)|| +nAi ^ v^|/3j| + nAa ^ | | |c 

pi s pi 

-nAi^wi,||6,-(/3o)|| -riAi w^yv^l/^oil - ^A2 5^^/;2,||&i(/3o)||c, (12) 
j=i i=pi+i j=i 

where for any random matrix W with n rows, we set W = QzW = W — PzW to be 
the projection of columns of W onto the orthogonal complement of the column space 
of Z, where Pz = Z{Z^Z)-^Z^. 

Using that Z{Z^Z)-^ Z'v is inside the column space of Z, while all variables with 
~ are orthogonal to it, the first four terms in f[T^ are simplified to 

-{Y - X/3o)^(7iX«) + (l/2)||7iXw||2 + {l/2)\\Z{Z^Z)-hCm^ - {l/2)\\Z{Z^ Z)-'v{(5,W ■ 



InLemma[2](i)-(iii), weshowthat \ \{Y-XI3qY{Xu)\ \ = O(v^), \ \Z{Z^ Z)-^v{l3o)\ \ = 
0{\/nK{Xi\\w[\ \ + A2||tf2l|)), and the last two lines in (|T2l) involving the penalty 
terms is of order O {ny/K XiWw'-^W'yi + nK {XlWw'iW^ + A2||w2in- Since the eigenval- 
ues of X'^X/n are bounded away from zero by Lemma [2] (iv) and condition (c6), 
Q{b{l3)J) - Q(6(/3o),^o) is bounded below by 

ncjf + 0(a„)7i + 0(6„), 

for some c > and some positive sequences a„, 6„, the exact expression of which we 
choose not to write down explicitly. Thus if 71 = Cmax{a„/n, \/hnln\ for C > 
sufficiently large, the above displayed expression will be positive. The expression 
max{a„/n, \/bn/n} is exactly of order ^/sJn + y/K{Xi\\w'i\\ + A2IIW2II) as in the 
statement of the Theorem. □ 

Proof of TheoremlB As in the proof of Theorem 1, Z^^\X^'^^ is simply written as 
Z and X here. By the KKT condition, in addition to that 

- Zf{Y -Zb- XP) + nXiWijd\\bj\\ + nX2W2jd\\bj\U3 = 1, . . . (13) 

which has been used in the proof of Theorem [1], we also have that (6, /3) satisfies 

- Xj{Y -Zb- X(5) + nXi^Wijd\l3j\ = 0, j = pi + 1, . . . , s. (14) 
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Since Y = r' + X[3 + e where r' = (r'j^, . . . , r'^ with r[ = XljLi -^ijt^ji't)^ and denote 
by the vector containing the sphne coefficients that achieve optimal approximation 
of (3j(t), 1 < j < pi, and set a = r' — ZW, ( IT^ is rewritten as 

-Xj{e + a-Z{b- 6°) - X{(3 - /3o)) + nAiWi,\/Za|/3,| = 0, j = pi + 1, . . . , s. 

From (in]), we get Z{b-b^) = Z{Z^Z)-^Z^{e+a-X{l3-f3o)) + Z{Z^Z)-^v {v = v{(3) 
defined right after equation f lTT]) ) and plug into the above displayed equation we get 

-Xf{e + a - Z(Z^Z)-i[Z^(e + a-X{(3- f3o)) + v] - X{f3 - f3o)) 

+nXiWij\/Kd\f3j\ = 0,j=pi + l,...,s, 

that is, 

-XJ{^T^a - X{/3 - Po) - Z{Z^Z)-^v) + n\w^^d\^, \ = 0, j = pi + 1, . . . , s, 
from which we get 

= y^S-V2A„(X^X)-iX^(e + a) + ^Y.'^'^ A^{X^ Xy^X^ Z{Z^ Z)-\ 
+x/^E;1/2^„(X^X)-1A, (15) 

where A is a p2 "dimensional vector with components given by n\\W\j^J~Kd\P>j\^j = 
Pi + l,...,s. By Lemma |2] (iv), we can replace (X'^X/n)^'^ by which only 
results in a multiplicative factor 1 + o(l) and thus does not disturb the asymptotic 
distribution. 

It is easily shown 

||^s-V2A„s-i|| = 0(v^). 

Combining this with ||X"^a|| = 0{^/nsjK^ + ns / K'^'^^'^^^) (combining bounds (1211) - 
([23]) in Lemma |2](i) ), WX"^ Z{Z^ Z)-^v\\ = 0{ny/K{Xi\\w[\\ + AsimiD) (Lemma [2] 
(ii) ) and ||A|| = 0{nXi\/K\\w[\\) , and conditions (c9)(cl0), all terms in f|T5]) are o(l) 
except y^En^^^A„(X^X)-iX^e, which can be shown to converge to A^(0, a^I) by 
Lindeberg-Feller central limit theorem using standard arguments. □. 

Proof of TheoremlB Since {U^\ P^'^'') solves the optimization problem OH]), we have 
that 

-Zj{Y - Z^'W^ - X(2)/3(')) + nXiw^jdllbf^ \ \ + n\2W2jd\\bf^ 1 1, = 0, j = 1, . . . , pi, 

(16) 

-Xf{Y - Z^P) - X(2)/3(2)) + nXiWijVKd0f\ = 0, j = p, + 1, . . . , s. (17) 
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We remind the readers that the equations above actually mean "there exists some 
subdifferential that makes the left hand side zero" in case the subdifferential is not 
unique. 

In order to show that the pi^- dimensional vector b = [U^\U'^\U^^) with 6^-^^ = 

= + 1, . . . , s, k = 1, . . . , K and bf^ = 0, s + 1 < J < p,l < k < K solves ©, 
we only need to verify the corresponding KKT conditions, 

-Zj{Y-Z^%^^^-Z^%^^^-Z^%^^^) + nXiWijd\\bj\\+nX2W2jd\^^^^^^ =l,...,p. 

(18) 

First, for 1 < j < Pi, dH]) trivially follows from ([H]), since Z^%'-^^ - Z^%^^^ = 

X(2)/3{2). 

Next, for pi + I < j < s, ( fTSl) is implied by the following two results. 



(a) the if-dimensional vector -ZjiY - Z'-^^^^ - Z^'^W^ - ^(3)6(3)) + n\xWxjd\\bj \ \ 
is orthogonal to e := (1, 1, . . . , 1)^. 

(b) \\Z]{J - - Z(2)6(2) _ Z(3)6(3))|| + nAiWy < n\2W2r 

In fact, (a) implies that -Z]{J - Z^%^^^ - Z^%^^^ - Z^%^^^) +nXiWi^d\\bj\ \ = Qia 
{Ql is the matrix of projection onto the orthogonal complement of e as defined in 
Section 2) for some a, and (b) implies that HQl^II < n\2W2j and thus we can find 
a version of a with ||a|| < nX2W2j- If we choose the subdifferential c?||&j||c to be 
—QLO'/{n\2W2j) (note that this is indeed a subdifferential since ||&j||c = when 
Pi + 1 < j < s) then equation ( IT8|) is verified. 

For verifying (a), we can set d\\bj\\ = {sj, . . . , Sj)/VK where sj = d\l3f\ in ([T7D 
(it can be verified that (sj, . . . , Sj)/y/K is indeed a subdifferential). With this choice 
of d\\bj\\ , it can be easily checked that e^{-Zj{Y - Z'-^^U^^ - ^(2)5(2) _ ^(2)5(3)) + 

riAiWijC?! 1} is exactly equal to the left hand side of (fT7|) and thus equal to zero, 
which immediately implies (a). 

For verifying (b), we have \\Zj{Y - Z^^^S^i) - ^(2)5(2) _ z(3)6(3))|| < \\Zje\ \ + 

|Zj(Z(i) (6(1) - 6°) + (/3(2) _ 1 1 + \\zj{r' - Z^H'') \ \ (r', 6° define d in the proof 



of Theo rem [21) . Using exactly the same arguments as in Theorem 1 of iHuang et al. 
( 2010+1 ). we have maxj ||.^Je|| = 0{y^{n/K) \og{pK)). Besides, it is easy to see (us- 
ing TheoremOP) that \\Zj{Z'^^\U'^ - 6°) + X(2)(/3(2) _ + \\zj{r' - Z«60)|| = 

O (^^{n/K){Ks + ns/K"^^ + nK(Af | P + Ai||m|2))j and (b) is verified by condi- 
tion (cll) (condition (cll) also implies Ai||w^|| = o{\2W2j),Pi + 1 < j < s). 

Finally, for s + 1 < j < p in ( |T8l) . we only need to verify that \\Zj(Y — Z^^^b^^^ — 
2'(2)5(2) _ 2'(3)6(3)) 1 1 < nXiWij, s + 1 < j < p which follows exactly the same arguments 



19 



as in verifying (b) above and the details are omitted. □ 



Proof of Theorem^ For any given pair of regularization parameters A = (Ai, A2), 
we denote by 6a the minimizer of ([3]), and by b the minimizer when the optimal 
sequence of regularization parameters is chosen such that b results in a consistent 
model selection. We separately consider several different cases below. For each case, 
we implicitly assume that all previous cases do not happen since they have already 
been dealt with. 

Case 1. Some truly varying coeffients are estimated as constant or zero coefficients 
in b\. Similar to the calculations performed in the proof of Theorem [H we have 

— \\Y-Zbx\?-—\\Y-Zb\\'' 

> --\\Pz{Y-Zb)\\' + U\Z{b-kW- 
n An 

Since there is some j for which bj represents a truly varying coefficient with conver- 
gence rate given by Theorem [H while bxj has all K components equal to each other 
representing a constant coefficient, it is easy to show that \ \Z{b — b\)\\'^/n > \ \Zj(bj — 
b\j)\\'^/n is bounded away from zero by condition (cl3). Besides, ||P^(F — Zb)\\/n = 
0(1) (using the same arguments as in Lemma [1] as well as the proof of convergence 
rate in Theorem [1]) and the penalty terms in BIG are all of order o(l), thus the BIG 
when A is used is bigger than the BIG when the optimal regularization sequence is 
used (following the same arguments as in the proof of Theorem ? in ?). 

Case 2. Some nonzero constant coefficients are estimated as zeros in b\. This also 
represents an underfitted model and is dealt with similarly as in Gase 1. 

Case 3. Some zero or constant coefficients are estimated as truly varying in b\. 
Let b* be the minimizer of the least square | |y — Z6| p under the additional constraint 
that the model identified by 6a is used when minimizing the least square. We have 
that 

h\y - zk\? - h\y - zb\\' 

2n 2n 

> —\\Y-Zb*\\^-—\\Y-Zb\\^ 
- 2n " 2n" " 

= -iY -Zbfz(b-b*)^—\\zCb-\f)\? 
n 2n 

> -(Y - ZbfZ(b-b*). (19) 

n 

By the definition of 6* and the fact that we only search over models with size 0{s), 
the convergence rate of 6* can be obtained using similar arguments as Theorem [1] but 
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without the terms involving Ai and A2 appearing. Arguments similar to those used 
in showing result (b) in the proof of Theorem [3] can be used to show that the f|T9l) is 
bounded below by a negative term whose absolute value is of order 

-^{nslogipK) + ^) . (-- + ^) = 0( ), 

which is of order smaller than the BIG penalty term \og{n / K) / {n / K)Cn when C„ = 
VL{^\og{pK)) (note we assume s = 0(1)). That BIG cannot select such A can now 
be derived by standard arguments. 

Case 4- Some zero coefficients are estimated as nonzero constants. This case is 
similar to the previous one and the details are omitted. □ 



Proof of Theorem [3|. We on ly sketch the proof here. First using the general 
results in IWei and Huand (120071 ). which deal with linear models with group Lasso 
penalty, we can show that at most 0{s) covar iates are selected if An > \/ s \og{pK)/n. 
The only difference of our case from that of IWei and Huand (120071 ) is the necessity 
of an approximation of coefficient functions by spline expansions. However, this 
pr oblem can be solved b y following exactly the same lines in the proof of Theorem 1 
Huang et al.l (12010+1 ). using the bound for ||r — Zb^\ \ in Lemma [TJ The rest of the 



m 



proof on convergence rate follows the same strategy as in Theorem [H 



Lemma 1 Following notations defined in the proof of TheoremUi WvW^ = ||-Pz(^ — 
Z6°)||2 zs of order 0{Ks + ns/K^'^). 

Proof. Denote = ^^=1 ^jj/9j(ii) and r = (ri, . . . , r„)-^. We have Y — ZIP = 
e + (r — ZW) and | Ir^l p < 2| iP^ej p + 2| |r — p. By the approximation property of 
splines, Wr-Zb^W^ = Op{ns/K^'^). Also, ^||Pze|P = E{e^Pze) = aHr{Pz) = 0{sK) 
and the lemma is proved by an application of Markov inequality. □ 

We collect several miscellaneous results on bounding some terms used in the proof 
of Theorem [T] and Theorem [2] in the following Lemma. 

Lemma 2 Following the notations used in TheoremUl and Theorem\^ we have 

(i) \\{Y-XPoYX\\ = 0{^s). 

(ii) ||Z(Z^Z)-it;(/3o)|| =0(v^(Ai||K|| + A2|mi|)). 

(Hi) The last two lines in ^W) is o/orderO(nV^Ai||ti''^||7i+ni^'(A^||t(7'^|p+A2||ti'2lP))- 
(iv) ||X-^X/n — H|| = 0(1) where \\B\\ for a matrix B denotes its Frobenius norm. 
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Proof. 



(i) We first write down the decomposition 

X = e-G + G + U 

(note we follow the notation in Theorem 1 and 2 and write X*-^-* simply as X). The 
above uppercase letters represent n x p2 matrices, and correspond to the decomposi- 
tion in ([5]) evaluated at n observations. After projection, we have 



X = Q-G + G + U. 
Together with the decomposition 

(same as in the proof of Theorem [2], r' = {r[, . . . , r'^)"^ with = Xljli -^ijf^ji't)^ 
contains the spline coefficients that achieve optimal approximation of Poj{t), 1 < j < 
Pi), the bound for ||(i^ — X/3o)"^X|| is obtained from the following estimates. 

\\e^QzX\\ = 0(7^), (20) 

\\{r'-ZbYQz{e-G)\\ = yi^, (21) 



{r'-ZbrQzU\\ = Jli, (22) 



Wir'-ZbYQzGW = ^J^,^^ = 0{V^s), (23) 

where fl20l) is obvious from condition (cl), fl2T]) is based on that entries of B — G have 
mean zero and are orthogonal to Q while entries of (r' — Zlpy ^i-nd Z are inside Q and 
thus we can calculate the bound by considering its variance, fl2^ is obtained similarly, 
and finally (1251) is obtained from HQ^GH < HGH = 0{^/nsjK^) and conditions (c8). 

(ii) Obviously \\Z{Z^Zyh{/3)\\'^ = 0{K/n)\\v{l3)\\\ Using the fact that d\\bj\\ 
and c}||6j||c has I2 norm bounded by 1, it easily follows from the definition of v{f3) 
(below equation (^) that \\v{l3o)\\^ = Oin\Xl\\w[\\^ + XlWw'^W^)). 
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(iii) We have 

pi 

nX^J2^,,\\b,0)-b,m\ 
i=i 

< nX^\\w[\\■\\b0)~biPo)\\ 

< n\^\\w[\\ ■ i\\{Z^Z)-'Z^0 - (3o)\\ + \\iZ^Z)-\v0) - v{m\) 
= nAi||m|(7i/^ + ^(Aiimil + A2|k^||) 

= 0{V^Xi\\w[\\ji + nK{Xl\\w[\\^ + XlWw'^W^)), 

where in the 2nd hne above we used Cauchy-Schwartz inequahty, in the 3rd hne 
we used (|TT|) . in the 4th hne we used part (ii) of this Lemma. We can bound 
nX2Yl^Li''^2j\\bj{/3) — &j(/3o)||c in a similar way. 
Finally, 

nX,VKj2{w,,m - |/3o,|)} 
j 

< nXiVKj2{wij\$j - (3o,\} 

j 

< nAiv^||mi7i, 

using Cauchy-Schwartz inequality in the last line above. 

(iv) Using the decomposition X = T — PzT + G + U — PzU where F = 6 — G, we 
have that 

..(T^Unr^U) ^„ oi^)^o(l). (24) 



n y/n 

since each entry of (F + U)'^{r + U)/n — S has mean zero and the above can be proved 
by calculating the variance of each entry (this is just a standard way of proving the 
weak law of large numbers) . 

We also have the following bounds. 

F^PyF s s'^K 

n n n 

by that each entry of F is orthogonal to Q and entries of Z are in Q. 

\\'^^\\ = 0C-tr{Pz)) = 0C-^l (26) 
n n n 

by a similar reason as before. 
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\\^-^\\ = 0{4m) (27) 

by condition (c7). 

Other terms in \\X'^X/n — S|| can be bounded by Cauchy-Schwartz inequality 
utilizing (|2^-( |2711 . resulting in some additional o(l) terms, and part (iv) of the Lemma 
is proved. 
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